Refactor V1 timeout handling#1543
Conversation
ApprovabilityVerdict: Needs human review This PR significantly refactors timeout handling across the V1 system, removing many configurable timeout parameters and introducing a new framework-managed timeout architecture. Multiple unresolved P1 review comments identify scenarios where previously bounded operations may now hang indefinitely, requiring careful human assessment of the behavioral changes. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 87149fb579
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "--rollout-timeout", | ||
| dest="rollout_timeout_seconds", | ||
| type=float, | ||
| default=None, | ||
| help="Per-rollout wall-clock timeout in seconds. Overrides timeout_seconds in --extra-env-kwargs.", | ||
| help="Per-rollout wall-clock timeout in seconds.", |
There was a problem hiding this comment.
Keep --timeout as a compatibility alias
For users and config files that still use the documented prime eval run ... --timeout 600 / TOML timeout = 600 timeout surface, this rename makes argparse reject the flag before it can be mapped to rollout_timeout_seconds. Since this was the public per-rollout timeout entrypoint for existing eval workflows, removing it turns previously valid runs into CLI failures rather than applying the equivalent rollout timeout; keep --timeout as a deprecated alias to the same destination and normalize legacy TOML timeout as well.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be71389db5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| result = await client.run_background_job( | ||
| sandbox_id=sandbox_id, | ||
| command="bash test.sh", | ||
| working_dir="/tests", | ||
| timeout=test_timeout, | ||
| ) |
There was a problem hiding this comment.
Preserve Harbor verifier timeouts
For Harbor tasks that set [verifier].timeout_sec, this path now drops the parsed test_timeout and starts bash test.sh without passing any timeout, so a verifier that hangs or intentionally relies on its task-specific budget can block scoring until some unrelated client/global timeout instead of being capped per the task TOML. Keep carrying the parsed verifier timeout into this call or map it to the new timeout surface.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 37953fc57f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b5fe6754f7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| sandbox = await asyncio.wait_for(create_waiter, SANDBOX_CREATE_TIMEOUT_SECONDS) | ||
| except (asyncio.CancelledError, TimeoutError): | ||
| try: | ||
| sandbox = cast(SandboxRecord, await asyncio.shield(create_task)) |
There was a problem hiding this comment.
Don't await the timed-out sandbox creation task
When sandbox creation exceeds SANDBOX_CREATE_TIMEOUT_SECONDS, wait_for only cancels the shield wrapper, not create_task; the except TimeoutError path then awaits the same shielded create_task before it can re-raise or clean up. In an environment where the sandbox create RPC hangs, this new internal timeout therefore never returns and can still block rollout/global timeout handling indefinitely; bound or cancel/close the underlying create attempt instead of awaiting it unbounded after the timeout.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 72f59c6984
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| harbor = task["harbor"] | ||
| assert isinstance(harbor, dict) | ||
| task_dir = Path(str(harbor["task_dir"])) | ||
| test_timeout = test_phase_timeout_seconds(harbor.get("test_timeout_seconds")) |
There was a problem hiding this comment.
Restore a default Harbor verifier timeout
When a Harbor task omits [verifier].timeout_sec, this now resolves test_timeout to None and passes that through to run_background_job, whereas the previous path always fell back to verifier_timeout_seconds=900. In evals without an explicit rollout/global timeout, a stuck bash test.sh can now block scoring indefinitely for any task that relies on the default verifier budget; keep a finite default when the task TOML does not specify one.
Useful? React with 👍 / 👎.
# Conflicts: # packages/harnesses/harnesses/rlm.py # tests/test_v1_rlm_swe.py
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: adde70aec2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| fi | ||
| cd "$MINI_SWE_AGENT_WORKDIR" | ||
| timeout --kill-after=30s "${{AGENT_TIMEOUT_SECONDS:-3600}}" {shlex.quote(DEFAULT_MINI_BINARY)} \\ | ||
| {shlex.quote(DEFAULT_MINI_BINARY)} \\ |
There was a problem hiding this comment.
Keep MiniSWE agent runs bounded
With the default v1 environment settings, task_timeout_seconds is None, so this sandbox command is now sent to run_background_job without any timeout, and the shell-level timeout --kill-after ... ${AGENT_TIMEOUT_SECONDS:-3600} wrapper was removed here. If mini-swe-agent hangs or never exits despite --exit-immediately, evaluations using MiniSWEAgentProgramConfig can now block indefinitely instead of being capped at the previous 3600s default; keep a finite default task timeout or restore a command-level wrapper for this harness.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8ab88fdf3c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if "timeout" in config and "rollout_timeout_seconds" not in config: | ||
| config["rollout_timeout_seconds"] = config["timeout"] |
There was a problem hiding this comment.
Preserve per-eval legacy timeout overrides
When a TOML config sets a global rollout_timeout_seconds and an individual [[eval]] still uses the legacy timeout field, this normalization runs after {**global_defaults, **eval_config} and skips copying timeout because rollout_timeout_seconds is already present. That silently leaves the global rollout timeout in effect instead of the per-eval override, so existing configs cannot override a global timeout for one eval unless they rename the field everywhere; normalize each layer before merging or otherwise give the per-eval legacy field precedence.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 181e9ef. Configure here.
| completed = True | ||
| except FrameworkTimeoutError: | ||
| state["timed_out"] = True | ||
| state.stop("timeout_reached") |
There was a problem hiding this comment.
Rollout timeout skips scoring
Medium Severity
When the rollout-level timeout_after fires, Harness.run sets timed_out and calls stop but never runs score_rollout. Error rollouts still score in the same method, so timed-out rollouts can finish without reward or metrics.
Reviewed by Cursor Bugbot for commit 181e9ef. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 181e9ef40a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "agent_timeout" | ||
| if isinstance(exc, TimeoutError) | ||
| else "agent_recursion_limit" | ||
| result = await agent.ainvoke( |
There was a problem hiding this comment.
Preserve Wikispeedia timeout metric
When the Wikispeedia agent exceeds the new env.task_timeout_seconds, the timeout is raised and handled by Harness.local_callable_program rather than this try block, so the state keeps the initialized agent_timeout = False. The environment's agent_timeout metric reads that field, so timed-out Wikispeedia rollouts now report agent_timeout as 0 even though the agent hit the timeout; either mark this state on framework timeouts or update the metric to use task_timed_out.
Useful? React with 👍 / 👎.


Summary
Tests
New structure:
flowchart TB subgraph Inputs["Configurable eval inputs"] CLI["CLI<br/>--global-timeout<br/>--rollout-timeout<br/>--task-timeout"] TOML["TOML<br/>global_timeout_seconds<br/>rollout_timeout_seconds<br/>task_timeout_seconds"] Legacy["Legacy aliases<br/>--timeout<br/>TOML timeout"] end CLI --> EvalConfig["EvalConfig"] TOML --> EvalConfig Legacy -->|"normalize to"| RolloutConfig["rollout_timeout_seconds"] RolloutConfig --> EvalConfig EvalConfig -->|"global_timeout_seconds"| Global["Global timeout<br/>timeout_after(global_timeout_seconds)"] Global -->|"wraps"| EvalRun["run_evaluation body<br/>env server start<br/>vf_env.evaluate(...)<br/>result metadata/save work"] EvalRun -->|"finally outside global timeout"| StopServer["Env server stop<br/>wait_for(..., 600s)"] EvalConfig -->|"rollout_timeout_seconds"| EnvKwargs["vf_env.set_kwargs(...)"] EnvKwargs --> V1Env["V1 Env<br/>runtime_state rollout_timeout_seconds"] EnvKwargs -->|"compat bridge"| V0Env["v0 Environment / MultiTurnEnv<br/>timeout_seconds"] V1Env -->|"single rollout"| HarnessRollout["Harness.run<br/>timeout_after(rollout_timeout_seconds)"] V1Env -->|"group rollout"| GroupRollout["Env._run_group_states<br/>same rollout timeout wraps gather + score_group"] GroupRollout --> HarnessRollout HarnessRollout --> RolloutIncluded["Inside rollout timeout<br/>setup_state<br/>run_program<br/>is_completed checks<br/>collect_artifacts<br/>update_rollout<br/>record generation timing<br/>score_rollout<br/>set completed"] GroupRollout --> GroupScore["Inside same group rollout timeout<br/>score_group uses remaining rollout time"] HarnessRollout -->|"outside rollout timeout"| RolloutExcluded["Outside rollout timeout<br/>cleanup_rollout<br/>cleanup_group<br/>finalize / strip handles<br/>env-server shutdown and log finalization"] EvalConfig -->|"task_timeout_seconds"| TaskTimeout["Task timeout<br/>active agent / program work"] TaskTimeout --> LocalCallable["local callable program"] TaskTimeout --> BaseLoop["base model/tool loop"] TaskTimeout --> LocalCommand["local command body"] TaskTimeout --> SandboxCommand["sandbox command body<br/>run_background_job timeout=None<br/>poll interval 3s"] TaskTimeout --> Nemo["NeMo Gym run_once"] Wiki["Wikispeedia<br/>sets task_timeout_seconds = 1200s"] --> TaskTimeout subgraph Guards["Hardcoded detailed guards"] ModelGuard["LLM/API no-response guard<br/>Runtime wait_for model response = 21600s<br/>ClientConfig HTTP timeout = 21600s<br/>connect timeout = 5s"] SandboxLifecycle["Sandbox lifecycle<br/>lease TTL = 1440 min<br/>create = 3600s<br/>ready = 3600s<br/>late-create cleanup wait = 3600s<br/>wait attempts = 120"] SandboxSetup["Sandbox setup and files<br/>package/setup/program setup command = 900s<br/>file transfer = 900s"] TestPhase["Test/scoring phase<br/>Harbor timeout_sec -> test_timeout_seconds<br/>default None<br/>RLM SWE test command default None"] RlmSweDetails["RLM SWE detail guards<br/>pycache shell timeout = 30s<br/>move tests = 60s<br/>archive/upload/download/cat = 300s"] OpenEnvGuard["OpenEnv guards<br/>startup = 300s<br/>poll = 1s<br/>health request = 10s<br/>schema request = 30s<br/>creation attempts = 120<br/>local probe = 5s"] ProcessCleanup["Local process cancel cleanup<br/>terminate wait = 5s"] EndpointWaits["Endpoint/tunnel loop waits<br/>1s ticks"] SandboxEndpoint["Sandbox endpoint helper HTTP<br/>300s"] end BaseLoop --> ModelGuard SandboxCommand --> SandboxLifecycle SandboxCommand --> SandboxSetup HarnessRollout --> TestPhase TestPhase --> RlmSweDetails HarnessRollout --> OpenEnvGuard LocalCommand --> ProcessCleanup SandboxCommand --> EndpointWaits SandboxCommand --> SandboxEndpoint subgraph Removed["Removed or no longer user-configurable timeout knobs"] RemovedSandbox["SandboxConfig<br/>timeout_minutes<br/>command_timeout<br/>install_timeout<br/>setup_timeout<br/>create_timeout<br/>wait_timeout<br/>poll_interval"] RemovedProgram["ProgramConfig<br/>setup_timeout"] RemovedHarness["Harness/env-specific knobs<br/>timeout_seconds<br/>agent timeout<br/>exec_timeout<br/>environment timeout<br/>provider_timeout_ms"] RemovedTaskset["Taskset/service probe knobs<br/>Harbor verifier default 900s<br/>OpenEnv startup/health/schema config fields"] end RemovedSandbox -->|"replaced by constants or task timeout"| SandboxLifecycle RemovedSandbox -->|"replaced by constants"| SandboxSetup RemovedProgram -->|"replaced by constants"| SandboxSetup RemovedHarness -->|"mapped to task timeout or removed"| TaskTimeout RemovedTaskset -->|"replaced by constants / explicit test_timeout_seconds"| TestPhaseNote
High Risk
Removes many per-config timeout knobs in favor of fixed internal caps and runtime timeouts, so workloads that depended on custom
command_timeoutor sandbox minutes may behave differently; changes touch harness, sandbox lifecycle, and eval orchestration.Overview
V1 timeout handling is refactored around three eval-facing knobs—
global_timeout_seconds,rollout_timeout_seconds, andtask_timeout_seconds—wired throughEvalConfig, envset_kwargs, and runtime state on harness runs. The eval CLI adds--rollout-timeout,--task-timeout, and--global-timeout; legacy--timeout/ TOMLtimeoutmap to rollout timeout.Framework enforcement uses a new
timeout_aftercontext manager andFrameworkTimeoutErrorin harness, v1Env, sandbox commands, NeMo Gym, and full eval runs. Expired limits settimed_out/task_timed_outandstop_conditiontimeout_reachedinstead of relying on per-layerasyncio.wait_forin environments.Per-config timeout fields are removed from
SandboxConfig,ProgramConfig, and many environment/harness configs (command_timeout,setup_timeout,timeout_minutes, harnesstimeout_seconds, etc.). Internal safety limits move totimeout_utils(sandbox create/ready/setup, model no-response, OpenEnv probes, Harbor verifier seconds viatest_timeout_seconds). Sandbox lease TTL is fixed at 24h; RLM config renames (rlm_tools,rlm_repo_url, …) and dropsRLM_EXEC_TIMEOUTwiring.Default HTTP client timeouts rise from 1h to 6h. Downstream envs drop explicit sandbox/bash timeouts; Wikispeedia sets
env.task_timeout_secondsat load time.Reviewed by Cursor Bugbot for commit 181e9ef. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Refactor V1 timeout handling to use centralized constants and runtime-state-driven timeouts
timeout_minutes,command_timeout,setup_timeout,exec_timeout, etc.) with centralized constants defined intimeout_utils.pyand runtime-state-propagated values (rollout_timeout_seconds,task_timeout_seconds,global_timeout_seconds).timeout_afterasync context manager inasync_utils.pyused throughout harness, env, and sandbox code to enforce rollout-, task-, and global-level timeouts that raiseTimeoutErrorand mark state astimed_out.--rollout-timeout,--task-timeout, and--global-timeoutCLI flags ineval.py; legacy--timeoutstill works but maps torollout_timeout_seconds.SandboxConfig,ProgramConfig,HarborTasksetConfig,OpenEnvTasksetConfig, and several environment-specific configs; downstream code now relies on the centralized constants.ClientConfigandEndpointClientConfigrequest timeout from 1 hour to 6 hours.command_timeout,setup_timeout,timeout_minutes) will no longer have those values honored; behavior now depends on the new centralized constants or runtime state.Changes since #1543 opened
FrameworkTimeoutErrorexception class inverifiers.utils.async_utilsand modifiedtimeout_aftercontext manager to raise it instead ofTimeoutErrorwhen framework-managed timeouts expire [adde70a]Harness.runrollout handler to catchverifiers.errors.Errorexceptions, record them viarecord_error, and continue to completion with scoring and state updates [adde70a]create_sandboxinverifiers.v1.utils.sandbox_utilsto handle creation timeouts by returning promptly and scheduling asynchronous cleanup via newdelete_late_sandbox_createfunction [adde70a]Env._run_group_statesto apply rollout timeout to both per-task harness run gathering and optionalscore_groupinvocation within a single timeout scope [adde70a]NeMoGymHarness._run_nemo_gym,Harness.generate_program,Harness.run_tool_program,Harness.command_program, andrun_sandbox_commandto catchFrameworkTimeoutErrorinstead ofTimeoutErrorfor framework-managed timeouts [adde70a]TimeoutErrorfrom frameworkFrameworkTimeoutError, error rollout scoring and completion, group timeout behavior, and sandbox creation timeout with late cleanup [adde70a]create_sandboxasync function to catch bothasyncio.TimeoutErrorandTimeoutError, and changed the exception raising behavior to throw a newTimeoutErrorwith the message "Sandbox creation timed out." while chaining the original exception [8ab88fd]delete_late_sandbox_createasync function to catch bothasyncio.TimeoutErrorandTimeoutErrorduring late sandbox creation cleanup [8ab88fd]Macroscope summarized 81a4506.